Skip to content

Allow truncation when embedding #14493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

huydt84
Copy link
Collaborator

@huydt84 huydt84 commented Jul 2, 2025

Sometimes it frustrates me because llama-server automatically stops when slot.n_ctx < input token length in embedding task. I want it to be able to truncate the input token, as an option.

@huydt84 huydt84 requested a review from ngxson as a code owner July 2, 2025 04:31
@huydt84
Copy link
Collaborator Author

huydt84 commented Jul 4, 2025

@ngxson Please check

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be an intermediate solution for now, but I think the whole embedding system need to be reworked at some point.

Currently, we don't have to process the whole input in one single batch if we're using last pooling with causal attention. Having this can be useful for newer models like Qwen, while also unlock using one single model for both text generation & embeddings. @ggerganov WDYT?

Comment on lines +2587 to 2589
// Note: If the input was truncated (slot.truncated == true), this embedding
// represents only the processed portion of the original input
for (int i = 0; i < batch.n_tokens; ++i) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An easier way to avoid touching this loop - which now becomes more and more fragile - is to simply handle this truncation in launch_slot_with_task. You can truncate slot.prompt_tokens right after it's being std::moved

@ggerganov
Copy link
Member

I agree that the embeddings handling in llama-server and in libllama need some updates. I'll probably take a look soon. The main problem always has been that there are so many different embedding modes that it is hard understand what is needed, especially without having specific use cases at hand.

This change specifically does not seems to improve the situation - we introduce another special case with extra parameter and branches. Better to implement this properly later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants